Add `--no-op-offload` to improve `-ot` pp perf in MoE models like llama4 400B #13386

hjc4869 · 2025-05-08T15:32:44Z

This change is to address issue #13241

Including llama-bench support to help with performance tuning

ggml/include/ggml-backend.h

include/llama.h

Panchovix · 2025-05-09T18:05:48Z

Hi there! I tried this PR (applied the changes on top of the last commit, to use MLA + FA) on DeepSeek V3 0324, but I noticed less PP performance. I think it doesn't saturates the PCI-E for the main GPU when using the flag vs without, which results in less PP t/s.

Command to run was

./llama-server -m '/GGUFs/DeepSeek-V3-0324-UD-Q2_K_XL-merged.gguf' -c 16384 --no-mmap --no-warmup -v -ngl 99 --override-tensor 'blk\.([0-7])\..*_exps\.=CUDA0' --override-tensor 'blk\.([8-9]|1[0-1])\..*_exps\.=CUDA1' --override-tensor 'blk\.(1[2-6])\..*_exps\.=CUDA2' --override-tensor 'blk\.(1[7-9]|2[0-6])\..*_exps\.=CUDA3' -fa --override-tensor 'blk\..*_exps\.=CPU' -mg 0 --ubatch-size 1024 --disable-op-offload

RX/TX usage without the flag (GPU 0 gets saturated) while doing PP

RX/TX usage with the flag while doing PP

So PP t/s go from 66 t/s to 26 t/s

No flag:

prompt eval time =   35950.29 ms /  3218 tokens (   11.17 ms per token,    89.51 tokens per second)
       eval time =   44338.15 ms /   380 tokens (  116.68 ms per token,     8.57 tokens per second)

Flag:

prompt eval time =  122421.67 ms /  3218 tokens (   38.04 ms per token,    26.29 tokens per second)
       eval time =   49715.68 ms /   440 tokens (  112.99 ms per token,     8.85 tokens per second)

Maybe there is an incompatible flag I'm using?

hjc4869 · 2025-05-09T18:54:08Z

Using the flag should place way more load on the CPU, and whether that's beneficial at all should be highly specific to model offload params and hardware config.

For now I personally only tested llama 4 and qwen 3 with a relatively performant CPU (7970X) and a single relatively weak GPU (W7900), with a simple exps=CPU ot config, so YMMV. There's already a huge variance in terms of perf uplift in the 2 models I tested.

./build/bin/llama-bench -m ~/models/llama4-400b-hybrid-q8_0-q4_0.gguf -ngl 999 -fa 1 -ctk q8_0 -ctv q8_0 -ot 'exps=CPU' -mmp 0 --disable-op-offload 1,0 -n 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon PRO W7900 Dual Slot , gfx1100 (0x1100), VMM: no, Wave Size: 32

model	size	params	backend	ngl	type_k	type_v	fa	ot	mmap	dopo	test	t/s
llama4 17Bx128E (Maverick) Q8_0	216.56 GiB	400.71 B	ROCm,RPC	999	q8_0	q8_0	1	exps=CPU	0	1	pp512	232.02 ± 1.04
llama4 17Bx128E (Maverick) Q8_0	216.56 GiB	400.71 B	ROCm,RPC	999	q8_0	q8_0	1	exps=CPU	0	0	pp512	27.44 ± 0.04

./build/bin/llama-bench -m ~/models/qwen3-235b-a22b-q8_0-q4_0-hybrid.gguf -ngl 999 -fa 1 -ctk q8_0 -ctv q8_0 -ot 'exps=CPU' -mmp 0 --disable-op-offload 1,0 -n 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon PRO W7900 Dual Slot , gfx1100 (0x1100), VMM: no, Wave Size: 32

model	size	params	backend	ngl	type_k	type_v	fa	ot	mmap	dopo	test	t/s
qwen3moe 235B.A22B Q8_0	127.02 GiB	235.09 B	ROCm,RPC	999	q8_0	q8_0	1	exps=CPU	0	1	pp512	61.27 ± 0.13
qwen3moe 235B.A22B Q8_0	127.02 GiB	235.09 B	ROCm,RPC	999	q8_0	q8_0	1	exps=CPU	0	0	pp512	43.00 ± 0.09

Panchovix · 2025-05-09T18:56:25Z

Ah then that could be the reason then. I have 192GB RAM on a Ryzen 7 7800X3D, which is a consumer CPU so it is pretty weak for these tasks.

jukofyork · 2025-05-10T21:14:52Z

Could we not just parameterise the fixed 32 batch size threshold to offload? This would still let you disable offloading by setting to a very large value, but also allow more nuanced setting?

common/arg.cpp

slaren · 2025-05-10T21:31:04Z

A setting to control the minimum batch size would need to be per-backend, and configured via an environment variable.

jukofyork · 2025-05-10T21:45:05Z

A setting to control the minimum batch size would need to be per-backend, and configured via an environment variable.

Ah, sorry I forgot the 32 limit I was thinking of is CUDA specific.

iSevenDays · 2025-09-02T07:10:51Z

FYI: using --no-op-offload on Dell R740 with two Nvidia 4090D 48G slows down prompt processing a lot
with the flag

prompt eval time =   78715.84 ms /  5480 tokens (   14.36 ms per token,    69.62 tokens per second)
eval time =   36350.14 ms /   195 tokens (  186.41 ms per token,     5.36 tokens per second)
total time =  115065.99 ms /  5675 tokens

without the flag

prompt eval time =  268849.41 ms / 75130 tokens (    3.58 ms per token,   279.45 tokens per second)
eval time =    6960.09 ms /    39 tokens (  178.46 ms per token,     5.60 tokens per second)
total time =  275809.51 ms / 75169 tokens

Add --disable-op-offload

2e74787

github-actions bot added testing Everything test related examples ggml changes relating to the ggml tensor library for machine learning labels May 8, 2025

slaren reviewed May 8, 2025

View reviewed changes

ggml/include/ggml-backend.h Outdated Show resolved Hide resolved

include/llama.h Outdated Show resolved Hide resolved

hjc4869 added 2 commits May 9, 2025 19:07

Avoid negative bools in library.

31e1920

Fix default value of ggml_backend_sched_new

0d53a04

slaren linked an issue May 9, 2025 that may be closed by this pull request

Feature Request: Allow disabling offload_op for backends by user #13241

Closed

4 tasks

ikawrakow mentioned this pull request May 10, 2025

Handle incompatible DeepSeek GGUFs ikawrakow/ik_llama.cpp#394

Merged

hjc4869 requested a review from slaren May 10, 2025 15:36

slaren reviewed May 10, 2025

View reviewed changes

common/arg.cpp Outdated Show resolved Hide resolved

Rename to --no-op-offload for consistency

eae3a31

hjc4869 changed the title ~~Add --disable-op-offload to improve -ot pp perf in MoE models like llama4 400B~~ Add --no-op-offload to improve -ot pp perf in MoE models like llama4 400B May 11, 2025

slaren approved these changes May 11, 2025

View reviewed changes

slaren merged commit 7f323a5 into ggml-org:master May 11, 2025
1 check passed

hjc4869 deleted the no_op_offload branch May 11, 2025 13:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add `--no-op-offload` to improve `-ot` pp perf in MoE models like llama4 400B #13386

Add `--no-op-offload` to improve `-ot` pp perf in MoE models like llama4 400B #13386

hjc4869 commented May 8, 2025

Uh oh!

Uh oh!

Uh oh!

Panchovix commented May 9, 2025

Uh oh!

hjc4869 commented May 9, 2025

Uh oh!

Panchovix commented May 9, 2025

Uh oh!

jukofyork commented May 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

slaren commented May 10, 2025

Uh oh!

jukofyork commented May 10, 2025

Uh oh!

Uh oh!

iSevenDays commented Sep 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Add --no-op-offload to improve -ot pp perf in MoE models like llama4 400B #13386

Add --no-op-offload to improve -ot pp perf in MoE models like llama4 400B #13386

Conversation

hjc4869 commented May 8, 2025

Uh oh!

Uh oh!

Uh oh!

Panchovix commented May 9, 2025

Uh oh!

hjc4869 commented May 9, 2025

Uh oh!

Panchovix commented May 9, 2025

Uh oh!

jukofyork commented May 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

slaren commented May 10, 2025

Uh oh!

jukofyork commented May 10, 2025

Uh oh!

Uh oh!

iSevenDays commented Sep 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Add `--no-op-offload` to improve `-ot` pp perf in MoE models like llama4 400B #13386

Add `--no-op-offload` to improve `-ot` pp perf in MoE models like llama4 400B #13386

jukofyork commented May 10, 2025 •

edited

Loading